Jypter notebook

Before starting, let's take a look at the Jupyter notebook.

  1. Stopping and halting a kernel
  2. Looking at which notebooks are running
  3. Cells
  4. Adding cells above and below
  5. Changing type of cell from Markdown to Code
  6. Adding math

Class and objects

To import a module, you use the word import and then the name of the module


In [ ]:
import sklearn

You are able to import this because the module sklearn is already part of the Anaconda distribution. You can explore the modules that are part of sklearn by doing from sklearn import and then pressing Tab.


In [ ]:
# this it below
from sklearn import 

# this also works with submodules
from sklearn.linear_model import

In [ ]:
# from the submodule linear_model, lets import LinearRegression
from sklearn.linear_model import LinearRegression

Python is based on object-oriented programming (OOP).

  • Objects are containers of data and funcionality
  • Objects are of a class and that class might inherit funcionality from other classes
  • A class defines when and how the objects of that class would store data and how those objects would behave

The imported LinearRegression is a class definition. You can know the parents of a class by retrieving the __bases__ property


In [ ]:
LinearRegression.__bases__

To create an object, you call the class with parameters. To retrieve the possible parameters of class (or function) in the notebook, you can Shift-Tab (preview), double Shift-Tab (expanded window), triple Shift-Tab (expanded window with no time out), quadruple Shift-Tab (for split view of help)


In [ ]:
# try it below
LinearRegression()

Now, lets create a linear regression object


In [ ]:
lr = LinearRegression()

Again, we can explore that object by typing the name of object, then ., and then Tab


In [ ]:
# try it here
lr.

if we type lr into the notebook, we will get a customize description of the object


In [ ]:
lr

we can obtain a more programmatically class description by calling the built-in type command


In [ ]:
type(lr)

Now, objects have a global identity


In [ ]:
id(lr)

Datasets

sklearn has many datasets. We will take a diabetes dataset from it


In [ ]:
from sklearn.datasets import load_diabetes

In [ ]:
diabetes_ds = load_diabetes()

In [ ]:
X = diabetes_ds['data']
y = diabetes_ds['target']

sklearn works mostly with numpy array, which are $n$-dimensional arrays.


In [ ]:
[type(X), type(y)]

Numpy arrays

You can check the number of dimensions of an array


In [ ]:
X.ndim

Check the size of the dimensions


In [ ]:
X.shape

Get slices of the dimensions. The following are all the same thing: grab the first two rows of a matrix


In [ ]:
X[0:2]

In [ ]:
X[:2]

In [ ]:
X[0:2, :]

We can also grab columns in the same way


In [ ]:
X[:, 0:2]

Sometimes you want to grab just one column (feature), but the numpy returns a one dimensional object


In [ ]:
X[:, 2].shape

We can reshape the $nd$-array and add one dimension:


In [ ]:
X[:, 2].reshape([-1, 1])

In [ ]:
X[:, 2].reshape([-1, 1]).shape

You can do matrix algebra:


In [ ]:
# transpose
X.T.shape

In [ ]:
X.dot(X.T).shape

For more functions, you can importa numpy


In [ ]:
import numpy.linalg as la

In [ ]:
la.inv(X.dot(X.T)).shape

Fitting models

OK, let's go back to our example with linear regression.

Usually sklearn objects starts by fitting the data, then either predicting or transforming new data. Predicting is usually for supervised learning and transforming is for unsupervised learning.


In [ ]:
# explore the parameters of fit
lr.fit

In [ ]:
lr2 = lr.fit(X[:, [2]], y)

fit returns an object. If we examine the id of the object it returns:


In [ ]:
id(lr2)

In [ ]:
id(lr)

We realize that it is the same object lr, therefore, the call is fitting the data and modifying the internal structure of the object and it is returning itself.

Therefore, you can chain calls, which is very powerful feature.

Explore the fitted object

By looking at the online documentation of the LinearRegression, we can know the parameters it found.


In [ ]:
lr.intercept_

In [ ]:
lr.coef_

Predicting


In [ ]:
# explore the parameters
lr.predict

In [ ]:
y_pred = lr.predict(X[:, [2]])

Because we know how linear regression works, we can produce the predictions ourselves


In [ ]:
y_pred2 = lr.intercept_ + X[:, [2]].dot(lr.coef_)

In [ ]:
# this checks that all entries in the comparison are True
np.all(y_pred2 == y_pred)

Now, due to the powerful concept of chaining, we can combine fit and predict in one line


In [ ]:
y_pred3 = lr.fit(X[:, [2]], y).predict(X[:, [2]])

In [ ]:
np.all(y_pred3 == y_pred)

Additional packages

Sometimes you want to use a package that you found online. Many of these packages are available throught the Python Install Packages (PIP) package manager.

For example, the package quandl allows quants to load financial data in Python.

We can install it in the console simply by typing

pip install quandl

And now we should be able to import that package


In [ ]:
import quandl

In [ ]:
import quandl
mydata = quandl.get("YAHOO/AAPL")

In [ ]:
mydata.head()

Pandas


In [ ]:
# this helps put the plot results in the browser
%matplotlib inline

Pandas is a package for loading, manipulating, and display data sets. It tries to mimick the funcionality of data.frame in R


In [ ]:
import pandas as pd

Many packages return data in pandas DataFrame objects


In [ ]:
apple_stocks = quandl.get("YAHOO/AAPL")

In [ ]:
type(apple_stocks)

We can display the beginning of a data frame:


In [ ]:
apple_stocks.head()

In [ ]:
apple_stocks.tail()

And also, we can plot it with pandas


In [ ]:
apple_stocks.plot(y='Close');

We can manipulate it too. Let's say we want to compute the stock returns

$$ r = \frac{V_t - V_{t-1}}{V_{t-1}} - 1$$

But for this, we need to compute a rolling filter


In [ ]:
apple_stocks[['Close']].pct_change().head()

In [ ]:
apple_stocks[['Close']].pct_change().plot();

In [ ]:
apple_stocks[['Close']].pct_change().hist(bins=100);

Spark

Spark is a distributed in-memory big data analytics framework. It is hadoop on steriods.

Because we launched this jupyter notebook with pyspark, we have available automatically a variable called Spark context sc which gives us access to the master and therefore to the workers.

If we go to see the Spark dashboard (usually in port 4040), we can see some of the variables.

With Spark context you can read data from many sources, including HDFS (Hadoop File System), Hive, Amazon's S3, files, and databases.


In [ ]:
# explore the variables and functions availabe in the Spark context
sc

Spark usually works with RDD (Resilient Distributed Dataset) and more recently they are moving towards DataFrame, which are similar to Pandas but distributed instead.


In [ ]:
rdd_example = sc.parallelize([1, 2, 3, 4, 5, 6, 7])

We can check the id of the RDD in the cluster


In [ ]:
rdd_example.id()

In [ ]:
# this is a RDD
type(rdd_example)

Let's explore the funcions we have available


In [ ]:
rdd_example.

One such function is take that allows you to get a taste of what the file contains


In [ ]:
rdd_example.take(3)

Let's say you want to apply an operation to each element of the list


In [ ]:
def square(x):
    return x**2

now we can apply that transformation to the RDD with the map function


In [ ]:
rdd_result = rdd_example.map(square)

Now you might notice that this returns immediately. Well, this is because operations on RDD are lazily evaluated


In [ ]:
type(rdd_result)

So rdd_result is another RDD


In [ ]:
rdd_result.id()

Now in fact, there is no duplication of data. Spark builds a computational graph that keeps tracks of dependencies and recomputes if something crashes.

We can take a look at the contents of the results by using take again. Since take is an action, it will trigger a job in the Spark cluster


In [ ]:
rdd_result.take(3)

In [ ]:
rdd_result.count()

In [ ]:
rdd_result.first()

Usually, one you have your results, you write it back to Hadoop for later preprocessing, because they usually won't fit in memory.


In [ ]:
# this function can save into HDFS using Pickle (Python's internal) format
rdd_result.saveAsPickleFile()

Spark's DataFrame

Now, DataFrame has some structure. Again, you can create them from different sources. In this case, DataFrame funcionality is available from another context called the sqlContext. This gives us access to SQL-like transformations.

In this example, we will use the sklearn diabetes dataset again


In [ ]:
from sklearn.datasets import load_diabetes
import pandas as pd

In [ ]:
diabetes_ds = load_diabetes()

To create a dataset useful for machine learning we need to use certain datatypes


In [ ]:
from pyspark.mllib.regression import LabeledPoint

In [ ]:
l

In [ ]:
from pyspark.ml.linalg import Vectors

In [ ]:
d

In [ ]:
Xy_df = sqlContext.createDataFrame([
        [float(l), Vectors.dense(d)] for d, l in zip(diabetes_ds['data'], diabetes_ds['target'])],
                                  ["y", "features"])

In [ ]:
Xy_df

We can register the table in Spark as an SQL


In [ ]:
Xy_df.registerTempTable('Xy')

And then run queries


In [ ]:
sql_result1_df = sqlContext.sql('select count(*) from Xy')

In [ ]:
# which again is lazily executed
sql_result1_df

In [ ]:
sql_result1_df.take(1)

We can again run large scale regression using DataFrame


In [ ]:
from pyspark.ml.regression import LinearRegression

In [ ]:
lr_spark = LinearRegression(featuresCol='features', labelCol="y")

In [ ]:
lr_spark.coefficients

In [ ]:
lr_results = lr_spark.fit(Xy_df)

In [ ]:
lr_results.coefficients

In [ ]:
lr_results.intercept